Preprocessing

What filtering thresholds did you choose and how did you decide on them? Upon consulting the original publication, I decided to filter out cells with fewer than 500 expressed genes, cells with fewer than 800 UMIs, and cells with mitochondrial percentage greater than 10% for the sake of reproduction. These are pretty permissive thresholds as they keep in the outliers on the higher end of the expressed genes and UMI distributions which I probably would have filtered out. However, tt makes sense to use consistent filtering threshold across all samples which is another reason I decided to replicate this reasoning from the original publication.

How many cells / genes are present before and after implementing your filtering thresholds?

load("qc_summary.RData")
qc_summary <- rbind(
  qc_summary,
  data.frame(
    Sample = "Total",
    Cells_Before = sum(qc_summary$Cells_Before, na.rm = TRUE),
    Genes_Before = sum(qc_summary$Genes_Before, na.rm = TRUE),
    Cells_After = sum(qc_summary$Cells_After, na.rm = TRUE),
    Genes_After = sum(qc_summary$Genes_After, na.rm = TRUE)
  )
)

qc_summary
##             Sample Cells_Before Genes_Before Cells_After Genes_After
## 1  HB17_background        36223        21238        1683       21238
## 2         HB17_PDX        70643        26492       67005       26492
## 3       HB17_tumor        73238        26552       67219       26552
## 4         HB30_PDX        86723        26660       22965       26660
## 5       HB30_tumor        85180        27350       28727       27350
## 6  HB53_background        79374        26247       47873       26247
## 7       HB53_tumor        85108        27456       27791       27456
## 8            Total       516489       181995      263263      181995
## 9            Total      1032978       363990      526526      363990
## 10           Total      2065956       727980     1053052      727980

There were a total of 1,032,978 cells and 363,990 genes across all samples before filtering and doublet removal. After filtering there were 526,526 cells and 363,990 genes. The individual sample before and after filtering numbers can be seen in the table above.

Look in the literature, what are some potential strategies to set thresholds that don’t rely on visual inspection of plots? There is MAD to identify outliers in QC metrics like number of genes, UMIs, or percent mitochondrial reads. This method uses Median Absolute Deviation and captures the central distribution while removing statistical outliers.Tools like CellBender which is a deep-learning based tool, automatically identify ambient RNA, empty droplets, and technical artifacts.

Preprocessing Figures

Counts per sample

### Integrated Violin ### Doublet Detection

##Normalization I am using Seurat’s default global-scaling normalization method, which normalizes the gene expression measurements for each cell by the total expression, multiplies by a scale factor (10,000 by default), and then log-transforms the result. This procedure helps account for differences in sequencing depth across cells.

Feature Selection

Seurat “vst” method was used with standard options which selects the top 2,000 highly variable features So for my downstream analysis, I am including 2000 highly variable genes and excluding (28992 - 2000) genes. For a dataset this size, choosing 2000 highly variable features is reasonable. Seurat’s default method doesn’t use an absolute threshold (like a fixed variance value). It ranks all genes by their standardized variance (after accounting for mean expression), and then selects the top n. In my case, I went with Seurats standard top n of 2000.

Feature Selection Plot

## PCA Based on the elbow plot below I decided to go with 20 PCs for the downstream analysis. ### PCA Figures

Clustering and Visualization

I used the standard Seurat settings to Find Neighbors and Find Clusters. I used 20 PCS and a resolution of 0.5 to get THIS MANY clusters. How many cells come from each sample individually? How many total cells present in the entire dataset? Based on the UMAP plots I created, I decided that integration is needed. When labeling by sample ID, the data was clustering on sample and so not well integrated. It made sense that the data was clustering a lot on type but for sample the data should not have been splitting on it.

###Clustering and Visualization plots

Integration

I used Harmony to perform the integration. After running the UMAP again, I saw that though the sample still drove some of the separation there was a lot more overlap which made me feel better about the biological significance of the clusters and their annotations going forward.

###Integration Plots

Marker Gene Analysis

Briefly describe the method performed to identify marker genes. Discuss a few advantages and disadvantages of the method used to perform marker gene analysis.

I used Seurats FindAllMarkers to indentify marker genes. The function identifies marker genes by performing differential expression testing between each cluster and all other cells, typically using the Wilcoxon rank-sum test by default.The advantage of this method is that it doesnt assume normality but a disadvantage is that it is computationally intensive and is sensitive to clustering quality.

###Marker Gene Analysis Figures

##            p_val avg_log2FC pct.1 pct.2     p_val_adj cluster        gene
## 1   0.000000e+00   4.210278 0.561 0.082  0.000000e+00       0        CST4
## 2   0.000000e+00   2.472077 0.464 0.183  0.000000e+00       0    SERPINE2
## 3   0.000000e+00   2.548342 0.313 0.121  0.000000e+00       0       STPG2
## 4   0.000000e+00   2.736698 0.305 0.115  0.000000e+00       0        ARSE
## 5   0.000000e+00   2.383730 0.307 0.140  0.000000e+00       0       KITLG
## 6   0.000000e+00   2.866561 0.790 0.313  0.000000e+00       1       PEG10
## 7   0.000000e+00   3.766502 0.697 0.233  0.000000e+00       1         AFP
## 8   0.000000e+00   2.990434 0.391 0.124  0.000000e+00       1         HBB
## 9   0.000000e+00   3.679762 0.322 0.082  0.000000e+00       1     SLC35F1
## 10  0.000000e+00   3.906108 0.270 0.065  0.000000e+00       1      RPS4Y1
## 11  0.000000e+00   6.867828 0.530 0.006  0.000000e+00      10     MSC-AS1
## 12  0.000000e+00   6.805759 0.362 0.006  0.000000e+00      10      PLA2G5
## 13  0.000000e+00   7.102581 0.355 0.004  0.000000e+00      10      PKNOX2
## 14  0.000000e+00   6.804875 0.362 0.012  0.000000e+00      10      MGAT4C
## 15  0.000000e+00   6.890617 0.316 0.004  0.000000e+00      10         GEM
## 16  0.000000e+00   6.925527 0.665 0.012  0.000000e+00      11       CD247
## 17  0.000000e+00   6.493208 0.576 0.018  0.000000e+00      11      CARD11
## 18  0.000000e+00   6.459907 0.456 0.009  0.000000e+00      11     SLFN12L
## 19  0.000000e+00   6.769585 0.349 0.005  0.000000e+00      11       RUNX3
## 20  0.000000e+00   7.366675 0.327 0.004  0.000000e+00      11      BCL11B
## 21  0.000000e+00   2.446426 0.548 0.112  0.000000e+00      12       DPP10
## 22  0.000000e+00   2.527509 0.470 0.037  0.000000e+00      12    MIR646HG
## 23  0.000000e+00   3.061346 0.434 0.031  0.000000e+00      12  AC114812.2
## 24  0.000000e+00   2.497557 0.359 0.028  0.000000e+00      12   LINC01446
## 25  0.000000e+00   2.663045 0.255 0.009  0.000000e+00      12       TSHZ3
## 26  0.000000e+00   3.637234 0.608 0.036  0.000000e+00      13       GRIK2
## 27  0.000000e+00   3.985827 0.598 0.029  0.000000e+00      13        PLD5
## 28  0.000000e+00   4.118178 0.368 0.008  0.000000e+00      13        VWA2
## 29  0.000000e+00   4.212434 0.317 0.008  0.000000e+00      13  AL591686.2
## 30  0.000000e+00   3.675575 0.273 0.008  0.000000e+00      13 CCDC148-AS1
## 31  0.000000e+00   4.086553 0.814 0.044  0.000000e+00      14        AVIL
## 32  0.000000e+00   4.683535 0.779 0.022  0.000000e+00      14       ABCB5
## 33  0.000000e+00   4.194562 0.694 0.022  0.000000e+00      14       LTBP4
## 34  0.000000e+00   4.127928 0.550 0.020  0.000000e+00      14  ST6GALNAC5
## 35  0.000000e+00   4.457319 0.524 0.013  0.000000e+00      14    GNAS-AS1
## 36  0.000000e+00   7.931420 0.495 0.001  0.000000e+00      15  AL049629.1
## 37  0.000000e+00   7.944490 0.495 0.001  0.000000e+00      15  AC245041.2
## 38  0.000000e+00   7.382536 0.454 0.002  0.000000e+00      15      SLC5A1
## 39  0.000000e+00   7.387989 0.340 0.001  0.000000e+00      15      PRSS22
## 40  0.000000e+00   7.959469 0.289 0.001  0.000000e+00      15   LINC00842
## 41  0.000000e+00   4.403750 0.917 0.027  0.000000e+00      16  AC018467.1
## 42  0.000000e+00   4.179339 0.833 0.017  0.000000e+00      16   LINC01488
## 43  0.000000e+00   4.452957 0.683 0.009  0.000000e+00      16  AP000424.1
## 44  0.000000e+00   4.311117 0.500 0.009  0.000000e+00      16  AC011498.4
## 45  0.000000e+00   4.384808 0.383 0.006  0.000000e+00      16  AP003555.1
## 46 3.361555e-210   6.670709 1.000 0.040 9.745822e-206      17         TAT
## 47 2.422180e-107   5.321169 0.459 0.015 7.022383e-103      17        SAA4
## 48  1.474614e-97   4.985658 0.622 0.031  4.275201e-93      17     CYP4A11
## 49  2.801567e-79   5.268070 0.297 0.009  8.122303e-75      17        APOF
## 50  2.295483e-61   4.969043 0.973 0.143  6.655064e-57      17        ASS1
## 51  0.000000e+00   4.927727 0.971 0.045  0.000000e+00       2         CRP
## 52  0.000000e+00   5.272700 0.851 0.050  0.000000e+00       2        NNMT
## 53  0.000000e+00   5.050027 0.327 0.028  0.000000e+00       2      CYP2C8
## 54  0.000000e+00   5.752337 0.304 0.025  0.000000e+00       2      CYP2A6
## 55  0.000000e+00   5.530231 0.300 0.024  0.000000e+00       2      CYP2A7
## 56  0.000000e+00   3.498518 0.552 0.039  0.000000e+00       3   LINC00470
## 57  0.000000e+00   3.333775 0.559 0.069  0.000000e+00       3        GPC5
## 58  0.000000e+00   3.733413 0.512 0.048  0.000000e+00       3  AC098617.1
## 59  0.000000e+00   3.715704 0.434 0.031  0.000000e+00       3      TMEFF2
## 60  0.000000e+00   3.869333 0.256 0.010  0.000000e+00       3      CROCC2
## 61  0.000000e+00   1.946839 0.852 0.301  0.000000e+00       4    KCNQ1OT1
## 62  0.000000e+00   2.263821 0.606 0.134  0.000000e+00       4       KCNQ1
## 63  0.000000e+00   1.948869 0.437 0.092  0.000000e+00       4  AC022146.2
## 64  0.000000e+00   2.133614 0.305 0.054  0.000000e+00       4      PRDM16
## 65  0.000000e+00   1.957344 0.250 0.046  0.000000e+00       4     RTN4RL1
## 66  0.000000e+00   3.153081 0.401 0.013  0.000000e+00       5     C2orf48
## 67  0.000000e+00   2.669669 0.355 0.015  0.000000e+00       5      IQGAP3
## 68  0.000000e+00   2.722166 0.324 0.013  0.000000e+00       5       CDC45
## 69  0.000000e+00   2.760342 0.300 0.022  0.000000e+00       5  AC007608.1
## 70  0.000000e+00   2.911755 0.267 0.015  0.000000e+00       5  AC079594.2
## 71  0.000000e+00   3.972423 0.850 0.055  0.000000e+00       6  AP002518.2
## 72  0.000000e+00   3.899813 0.711 0.017  0.000000e+00       6        DTX1
## 73  0.000000e+00   4.037068 0.611 0.021  0.000000e+00       6      SLFNL1
## 74  0.000000e+00   4.164398 0.593 0.011  0.000000e+00       6  AC012613.1
## 75  0.000000e+00   4.209566 0.250 0.003  0.000000e+00       6  AC011498.4
## 76  0.000000e+00   4.485375 0.447 0.023  0.000000e+00       7       EPHA6
## 77  0.000000e+00   4.587825 0.426 0.024  0.000000e+00       7         NTM
## 78  0.000000e+00   4.528960 0.365 0.022  0.000000e+00       7        PLD5
## 79  0.000000e+00   4.176892 0.364 0.030  0.000000e+00       7       GRIK2
## 80  0.000000e+00   4.762748 0.251 0.007  0.000000e+00       7  AC019068.1
## 81  0.000000e+00   5.766574 0.469 0.010  0.000000e+00       8       PDE2A
## 82  0.000000e+00   5.310030 0.463 0.013  0.000000e+00       8        TOX2
## 83  0.000000e+00   5.217499 0.460 0.021  0.000000e+00       8       STAB1
## 84  0.000000e+00   5.316983 0.368 0.007  0.000000e+00       8      NOTCH4
## 85  0.000000e+00   5.130017 0.259 0.013  0.000000e+00       8  AC011476.3
## 86  0.000000e+00   5.556088 0.482 0.008  0.000000e+00       9        FGD2
## 87  0.000000e+00   6.078418 0.421 0.008  0.000000e+00       9      IGSF21
## 88  0.000000e+00   5.372153 0.370 0.006  0.000000e+00       9       WDFY4
## 89  0.000000e+00   5.836110 0.348 0.006  0.000000e+00       9       TRPM2
## 90  0.000000e+00   5.557584 0.319 0.004  0.000000e+00       9       CYTH4

Automatic Annotation of Cell Labels

I used SingleR to perform do automatic annotation of cell labels. The algorithm uses reference transcriptomic datasets of pure cell types to infer the cell origin. Based on the figure below most of the cell identities appear to be Hepatocytes and erythroblasts which are cell types found in the liver. This makes sense as the original publication is about identifying tumor cell populations in the hepatoblastoma environment.

Citation: Aran D, Looney AP, Liu L, Wu E, Fong V, Hsu A, Chak S, Naikawadi RP, Wolters PJ, Abate AR, Butte AJ, Bhattacharya M (2019). “Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage.” Nat. Immunol., 20, 163-172. doi:10.1038/s41590-018-0276-y.

Automatic Annotation of Cell Labels Figures

## Manual Gene Annotation Sources 0:Uhlén M et al., Tissue-based map of the human proteome. Science (2015) PubMed: 25613900 DOI: 10.1126/science.1260419 1: AFP Sauzay, Chloé, et al. “Alpha-foetoprotein (AFP): A multi-purpose marker in hepatocellular carcinoma.” Clinica chimica acta 463 (2016): 39-44. 2: CYP2C8 Uhlén M et al., Tissue-based map of the human proteome. Science (2015) PubMed: 25613900 DOI: 10.1126/science.1260419 3: GCP5 Bosworth AP, Contreras M, Sancho L, Salas IH, Paumier A, Novak SW, Manor U, Allen NJ. Astrocyte glypican 5 regulates synapse maturation and stabilization. Cell Rep. 2025 Mar 25;44(3):115374. doi: 10.1016/j.celrep.2025.115374. Epub 2025 Mar 5. PMID: 40048429; PMCID: PMC12013928. 4: RTN4RL1 Uhlén M et al., Tissue-based map of the human proteome. Science (2015) PubMed: 25613900 DOI: 10.1126/science.1260419 5: C2orf48 IQGAP3 CDC45 Uhlén M et al., Tissue-based map of the human proteome. Science (2015) PubMed: 25613900 DOI: 10.1126/science.1260419 6: DTX4 Wang, Zonggui, et al. “E3 ubiquitin ligase DTX4 is required for adipogenic differentiation in 3T3-L1 preadipocytes cell line.” Biochemical and Biophysical Research Communications 492.3 (2017): 419-424. 7: PLD5 Uhlén M et al., Tissue-based map of the human proteome. Science (2015) PubMed: 25613900 DOI: 10.1126/science.1260419 8: NOTCH4 Lai, Peng-Yeh, Chong-Bin Tsai, and Min-Jen Tseng. “Active form Notch4 promotes the proliferation and differentiation of 3T3-L1 preadipocytes.” Biochemical and biophysical research communications 430.3 (2013): 1132-1139. 9: FGD2 Huber, Christoph, et al. “FGD2, a CDC42-specific exchange factor expressed by antigen-presenting cells, localizes to early endosomes and active membrane ruffles.” Journal of Biological Chemistry 283.49 (2008): 34002-34012. 10:PLA2G5 Uhlén M et al., Tissue-based map of the human proteome. Science (2015) PubMed: 25613900 DOI: 10.1126/science.1260419 11: CD247 Eldor, Roy, et al. “CD247, a Novel T Cell–Derived Diagnostic and Prognostic Biomarker for Detecting Disease Progression and Severity in Patients With Type 2 Diabetes.” Diabetes care 38.1 (2015): 113-118. 12:Caubit, Xavier, et al. “TSHZ3 deletion causes an autism syndrome and defects in cortical projection neurons.” Nature genetics 48.11 (2016): 1359-1369.

My strategy for naming the clusters was starting with the Human Protein Atlas which is cited for several of the clusters above. I did a search for each of the five top genes in the cluster in the Human Protein Atlas and specifically looked for cell type expression cluster and tissue specificity in the search. If there was a hit for the gene (not all genes were found) and some some enrichement in a specific cell type or tissue, I did a second search of the gene in PubMed to confirm if the two terms appeared together. I then went through the rest of the two 5 genes in order to get a sense of a consensus. If I were doing this for a publication I would probably cite a paper for each gene that I searched up which I did not do in this case - I either cited the Human Protein Atlas or a publication with the one of the genes.

Manual Cluster Labeling Figures

Replicate Figures

Below is Figure 3B replicated. This figure highlights marker genes used to identify the common cell population in liver and tumor samples. Although the shapes of our UMAPs differ - the general down regulation and upregulation of the markers across the plot between the publication and my reproduction of the figure is somewhat consistent.

Below is Figure 4A replicated

This figure plots mean normalized expression values across the three types of cells that are in this analysis: background hepatocytes, tumor cells from primary hepatoblastoma (HB), and tumor cells from patient-derived xenograft (PDX) models. This involved, mean-averaging gene expression per cell type, normalizing to remove baseline shifts, comparing expression trends across types and then quantifying the correlation with R2.The R squared values between the publication and my findings here are different but for PDX tumor cells and tumor cells the highest correlation remains with the published R squared at 0.902 and mine at 0.899 which is reassuring.

Additional Analysis

I performed Pseudotime analysis using Monocle3 to infer the transcriptional trajectory from background hepatocytes to tumor and PDX-derived cells. This approach can model how cellular identities transition over time and assess whether PDX cells faithfully recapitulate the progression observed in primary tumors. Additionally, by ordering cells along this trajectory it may be possible to identify regulated genes and potential regulators of tumor transformation.